Bootstrapping for Text Learning Tasks
نویسندگان
چکیده
When applying text learning algorithms to complex tasks, it is tedious and expensive to hand-label the large amounts of training data necessary for good performance. This paper presents bootstrapping as an alternative approach to learning from large sets of labeled data. Instead of a large quantity of labeled data, this paper advocates using a small amount of seed information and a large collection of easily-obtained unlabeled data. Bootstrapping initializes a learner with the seed information; it then iterates, applying the learner to calculate labels for the unlabeled data, and incorporating some of these labels into the training input for the learner. Two case studies of this approach are presented. Bootstrapping for information extraction provides 76% precision for a 250-word dictionary for extracting locations from web pages, when starting with just a few seed locations. Bootstrapping a text classifier from a few keywords per class and a class hierarchy provides accuracy of 66%, a level close to human agreement, when placing computer science research papers into a topic hierarchy. The success of these two examples argues for the strength of the general bootstrapping approach for text learning tasks.
منابع مشابه
Text classification from unlabeled documents with bootstrapping and feature projection techniques
Many machine learning algorithms have been applied to text classification tasks. In the machine learning paradigm, a general inductive process automatically builds a text classifier by learning, generally known as supervised learning. However, the supervised learning approaches have some problems. The most notable problem is that they require a large number of labeled training documents for acc...
متن کاملA Comparison Of Efficacy And Assumptions Of Bootstrapping Algorithms For Training Information Extraction Systems
Information Extraction systems offer a way of automating the discovery of information from text documents. Research and commercial systems use considerable training data to learn dictionaries and patterns to use for extraction. Learning to extract useful information from text data using only minutes of user time means that we need to leverage unlabeled data to accompany the small amount of labe...
متن کاملBootstrapping Biomedical Ontologies for Scientific Text using NELL
We describe an open information extraction system for biomedical text based on NELL (the Never-Ending Language Learner) (Carlson et al., 2010), a system designed for extraction from Web text. NELL uses a coupled semi-supervised bootstrapping approach to learn new facts from text, given an initial ontology and a small number of “seeds” for each ontology category. In contrast to previous applicat...
متن کاملMining Diagnostic Text Reports by Learning to Annotate Knowledge Roles
Several tasks approached by using text mining techniques, like text categoriza-tion, document clustering, or information retrieval, operate on the document level, making use of the so called bag-of-words model. Other tasks, like document summarization, information extraction, or question answering, have to operate on the sentence level, in order to fulfill their specific requirements. While bot...
متن کاملInvestigating Unsupervised Learning for Text Categorization Bootstrapping
We propose a generalized bootstrapping algorithm in which categories are described by relevant seed features. Our method introduces two unsupervised steps that improve the initial categorization step of the bootstrapping scheme: (i) using Latent Semantic space to obtain a generalized similarity measure between instances and features, and (ii) the Gaussian Mixture algorithm, to obtain uniform cl...
متن کامل